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Abstract — This paper addresses the problem of distributed learn- 
ing under communication constraints, motivated by distributed signal 
processing in wireless sensor networks and data mining with dis- 
tributed databases. After formalizing a general model for distributed 
learning, an algorithm for collaboratively training regularized ker- 
nel least-squares regression estimators is derived. Noting that the 
algorithm can be viewed as an application of successive orthogonal 
projection algorithms, its convergence properties are investigated and 
the statistical behavior of the estimator is discussed in a simplified 
theoretical setting. 

I. Introduction 

In this paper, we address the problem of distributed learning un- 
der communication constraints, motivated primarily by distributed 
signal processing in wireless sensor networks (WSNs) and data 
mining with distributed databases. WSNs are a fortiori designed to 
make inferences from the environments they are sensing; however 
they are typically characterized by constraints on energy and 
bandwidth, which limit the sensors' ability to share data with each 
other or with a centralized fusion center. In data mining with 
distributed databases, multiple agents (e.g., corporations) have 
access to possibly overlapping databases, and wish to collabo- 
rate to make optimal inferences; privacy or security concerns, 
however, may preclude them from fully sharing information. 
Nonparametric methods studied within machine learning have 
demonstrated widespread empirical success in many centralized 
(i.e., communication unconstrained) signal processing applica- 
tions. Thus, in both the aforementioned applications, a natural 
question arises: can the power of machine learning methods be 
tapped for nonparametric inference in distributed learning under 
communication constraints? 

In this paper, we address this question by formalizing a general 
model for distributed learning, and then deriving a distributed 
algorithm for collaborative training in regularized kernel least- 
squares regression. The algorithm can be viewed as an instan- 
tiation of successive orthogonal projection algorithms, and thus, 
insight into the statistical behavior of these algorithms can be 
gleaned from standard analyses in mathematical programming. 

A. Related Work 

Distributed learning has been addressed in a variety of other 
works. Reference [9] considered a PAC-like model for learning 
with many individually trained hypotheses in a distribution- 
specific learning framework. Reference [13] considered the clas- 
sical model for decentralized detection [17] in a nonparamet- 
ric setting. Reference [15] studied the existence of consistent 
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estimators in several models for distributed learning. From a 
data mining perspective, [6] and [12] derived algorithms for 
distributed boosting. Most similar to the research presented here, 
[7] presented a general framework for distributed linear regression 
motivated by WSNs. 

Ongoing research in the machine learning community seeks 
to design statistically sound learning algorithms that scale to 
large data sets (e.g., [3] and references therein). One approach 
is to decompose the database into smaller "chunks", and sub- 
sequently parallelize the learning process by assigning distinct 
processors/agents to each of the chunks. In principle, algorithms 
for parallelizing learning may be useful for distributed learning, 
and vice-versa. To our knowledge, there has not been an attempt 
to parallelize reproducing kernel methods using the approach 
outlined below. 

A related area of research lies in the study of ensemble meth- 
ods in machine learning; examples of these techniques include 
bagging, boosting, and mixtures of experts (e.g., [5] and others). 
Typically, the focus of these works is on the statistical and 
algorithmic advantages of learning with an ensemble and not on 
the problem of learning under communication constraints. To our 
knowledge, the methods derived here have not been derived in 
this related context, though future work in distributed learning 
may benefit from the many insights gleaned from this important 
area. 

Those familiar with the online learning framework may find 
our collaborative training algorithm reminiscent of the equations 
for additive gradient updates [11]. Though both algorithms may 
be interpreted in the context of successive orthogonal projection 
algorithms, it does not appear possible to specialize the current 
model for distributed learning in a way that recovers the online 
learning framework (or vice versa). 

The research presented here generalizes the model and algo- 
rithm discussed in [14], which focused exclusively on the WSN 
application. Distinctions between the current and former work are 
discussed in more detail below. 

B. Organization 

The remainder of this paper is organized as follows. In Section 
II, we review preliminary background information necessary for 
the remainder of the work. In Section III, we describe a general 
model for distributed learning and propose a distributed algo- 
rithm for collaboratively training regularized kernel least- squares 
regression estimators. Subsequently, we analyze the algorithm's 
convergence properties and use these properties to gain insight 
into the statistical behavior of the estimator in a simplified setting. 
We conclude with a discussion of the method in Section IV. 

II. Preliminaries 

In this section, we briefly review the supervised learning model 
for nonparametric least-squares regression, reproducing kernel 
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methods, and alternating projection algorithms. Since a thorough 
introduction to these models and methods is beyond the scope of 
this paper, we refer the reader to standard references on the topics; 
see, for example, [4], [8], [16] and references therein. 

A. Nonparametric Least-squares Regression 

Let X and Y be X and y~ valued random variables, respectively. 
X is known as the feature, input, or observation space; y is known 
as the label, output, or target space. For now, we allow X to be 
arbitrary, but take y = R. In the least- squares estimation problem, 
we seek a decision rule mapping inputs to outputs that minimizes 
the expected squared error. In particular, we seek a function g : 
X — > y that minimizes 

V{\g(X)-Y\ 2 }. 

It is well-known that rj(x) = E{F \X = x} is the loss minimizing 
rule. However, without prior knowledge of the joint distribution 
of (X,Y), this regression function cannot be computed. In the 
supervised learning model, one is instead provided a database S = 
{(xi,yi)}i=i of training examples with {xi,yi) G X x y Mi G 
{1, . . . , n}; the learning task is to use S to estimate rj(x). 

B. Regularized Kernel Methods 

Regularized kernel methods [16] offer one approach to nonpara- 
metric regression. In particular, let Hk denote the reproducing 
kernel Hilbert space (RKHS) induced by a positive semi-definite 
kernel K(-, •) : Xx X —> R; let || • \\n K denote the norm associated 
with Hk- In practice, the kernel K is a design parameter, chosen 
as a similarity measure between inputs to reflect prior application- 
specific domain knowledge. The regularized kernel least- squares 
estimate is defined as the solution fx G Hk of the following 
optimization problem: 
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The statistical behavior of this estimator is well-understood 
under various assumptions on the stochastic process that generates 
the examples 2/i)]T=i [16], [19]. In this paper, we focus 
primarily on algorithmic aspects of computing a solution to {T} (or 
an approximation thereof) in distributed environments. To this end, 
consider the following "Representer Theorem" proved originally 
in [10]. 

Theorem 1 ([10]): Let fx G Hk be the minimizer of {TJ. Then, 
there exists G K n such that 



fx(-)=^2c X , i K(;X i ). 



From a computational perspective, the result is significant because 
it states that while the objective function {U is defined over a 
potentially infinite dimensional Hilbert space, its minimizer must 
lie in a finite dimensional subspace. 

Finally, note that |lj can be naturally interpreted as an orthog- 
onal projection. In particular, by introducing an auxiliary vector 
z G K n , can be rewritten as the following optimization 
program: 
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s.t. 



y|li + A||/|| 2 

%i = f(Xi) 

z G K n 
/ G Hk- 
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Through the constraints in ©, ([!]) and are equivalent in the 
following sense: if fx is the minimizer of and (z',/^) is 
the solution of (13, then f' x = fx. Therefore, through (13, we 
can interpret the regularized kernel least- squares estimator as a 
projection of the vector (y, 0) G R n x Hk onto the set 

{(z, /) G R n x Hk '• Zi = f(xi) Mi G {1, n}} c R n x H K - 
This simple observation will recur in the sequel. 

C. Alternating Projections Algorithms 

Let A' be a Hilbert space with a norm denoted by || • ||. Let 
C\ , . . . , Cm be closed convex subsets of X whose intersection 
C = H^ 1 Ci is nonempty. Let Pc(x) denote the orthogonal 
projection of x G X onto C, i.e., 



Pc{x) 
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Define Pc i (x) analogously. 

Successive orthogonal projection (SOP) algorithms [4] provide 
a natural way to compute Pc(') given {Pci(')}iLi- F° r example, 
the (unrelaxed) SOP algorithm is defined as follows: 



) + i (Xn-l)- 
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In words, the algorithm successively and iteratively projects 
onto each of the subsets. In the case where d is a linear subspace 
for all i G {1, . . . , m}, this algorithm was first studied by von 
Neumann [18]. Often examined in the context of the convex 
feasibility problem, SOP has been generalized in various ways 
[4], to address more general convex sets and non-orthogonal (e.g., 
Bregman) projections; accordingly, the algorithm often takes on 
other names (e.g., the von Neumann-Halperin algorithm, Breg- 
man 's algorithm). Much of the behavior of this algorithm can 
be understood through Theorem 2; the proof of this fundamental 
result can be found in [2]. 

Theorem 2: Let {C^}™ x be a set of closed, convex subsets of 
X whose intersection C = D^ 1 Ci is nonempty. Let x n be defined 
as in ©. Then, for every x G C and every n > 1, 

\\x n - X\\ < \\Xn-X - X\\. 

Moreoever, lim^oo x n G D^Cj. If C{ are affine for all i G 
{1, ...,ra}, then lim^oo \\x n - Pc(x)\\ = 0. 

III. Distributed Kernel Regression 
A. The Model 

In contrast to the model for supervised learning reviewed in 
Section II, suppose that each member of a collection of m 
learning agents has limited access to the training database S = 
{(xi, i/i)}^ =1 . In particular, assume that learning agent i has access 
only to the training examples in subset Si C S. For convenience, 
we shall henceforth refer to {Si} 7 j r L 1 as an ensemble. 

A bipartite graph is a convenient way to represent an ensemble 
in this model for distributed regression. As depicted in Figure 1, 
nodes on the top-level of the graph represent learning agents; 
nodes on the bottom-level represent training examples. An edge 
between a learning agent i and a training sample j signifies that 
agent i has access to example j, i.e., (xj,yj) G Si. For now, 
we make no additional assumptions on the structural relationship 
between the agents' locally accessible training sets; for example, 
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we do not require the ensemble {Si} r {L 1 to partition S, nor do we 
require the corresponding bipartite graph to be connected in any 
way. 

To be concrete, consider a few examples that illustrate special- 
cases of the general model depicted in Figure 1. The standard 
centralized model for supervised learning can be represented by 
the graph in Figure 2, where each of the m learning agents has 
access to all exemplars in the training database. Figure 3 illustrates 
an ensemble where a publicly available database is available to 
all the learning agents, each of which retains a private training 
set. In some applications, X may be endowed with a topology. 
For example, in wireless sensor networks, X = R 2 may model 
locations in a city; learning agents (i.e., sensors) may exist as 
points within X, and query those examples that are "nearby" with 
respect to the underlying topology; such an ensemble is depicted 
in Figure 4. 

As mentioned earlier, the current model is a generalization of 
the the work discussed in [14]. Whereas [14] focuses exclusively 
on the WSN application by assuming a topology on X and 
by modeling one agent per training observation, the present 
formulation allows a more general structure with multiple agents 
per training datum and an arbitrary input space. 



Learning Agents 



Training Database 




Fig. 1. A Bipartite Graph Representation of an Ensemble in this Model for 
Distributed Regression 




Fig. 2. A "Centralized" Ensemble 




Fig. 3. An Ensemble with a Public Database 




Fig. 4. A Sensor Network: An Ensemble with Topology Dependent Structure 

Presumably, each of the m agents wishes to use nonparametric 
methods to estimate the regression function. One simple approach 



is for agent i to compute f\. using only the exemplars in its 
local training database Si . However, doing so ignores the structure 
of distributed regression and fails to exploit an opportunity to 
collaborate using the (partially) shared training database. 

We henceforth assume that after locally computing f\. G Hk, 
agent i may share fxi(xj) G R with any agent k such that 
(xj,yj) G Sfc. In other words, neighboring agents (with respect to 
the bi-partite graph) communicate point estimates for the training 
data they share. Using such limited communication, can the agents 
collaborate to jointly improve the accuracy of their estimates? 

In the next section, we derive a collaborative training algo- 
rithm in this model for distributed nonparametric regression. The 
algorithm is derived as an application of SOP algorithms applied 
to a relaxation of the classical regularized kernel least-squares 
estimator. Subsequently, we analyze its convergence properties 
and investigate its statistical properties in a simplified theoretical 
setting. 

B. A Collaborative Training Algorithm 

For technical convenience, let us introduce sets {Si}^L v such 
that Si C {1, . . . , n}. Let j G Si if and only if (xj,yj) G Si. In 
other words Si contains the indices of the training examples in 
Si as enumerated in S. Analogously, let S = {1, . . . , n}. 

To begin, let us rewrite |l} in a way that reveals the structure 
of distributed regression. To do so, first let us introduce a function 
fi G TLk for each agent i G {l,...,m}, and consider the 
following constrained optimization program: 

min \\z-y\\ 2 2 + E?=^i\\fi\\ 2 H K (5) 
s.t. Zj = fi(xj) Vj G S,i G {l,...,m} (6) 

fieH K ze{l,...,m} 

Here, the optimization variables are z G R n and C 
TL K \ S = {{x i: yi)}f =1 and {AJ^ c K are the program 
data. The coupling constraints in © dictate that for any feasible 
solution to ©, every agent's associated function is equivalent 
when evaluated at {xi}f =1 . As a result, one can think about © 
as an equivalent form of O in the following sense. 

Lemma 1: Let (z, f\ ± , fx m ) £ H n x7Y^ denote the solution 
of © and let fx G TLk denote the solution of CJ. Assume that 
A, >0 V* G {1, m}. Then, f Xl = • • • = A ro . If E™ i *i = A, 
then fx = fx^ 

This form of the regularized least-squares regression problem 
suggests a natural relaxation that allows us to incorporate the 
structure of the distributed regression model into the estimator. In 
particular, we relax the coupling constraints to require that agents 
agree only on training examples they share: 

min ||z-y||i + E^iA.||/dl^ (7) 
s.t Zj = fi{xj) Vj G S u i G {l,...,m} (8) 

fi eH K i e {1, . . . ,m} 

Thus, for any feasible solution to 0, /i(xj) = /fe(xj) if 
(xj,yj) G Si fl Sfc. Looked at in this way, © models the 
"centralized ensemble" depicted in Figure 2, while (Q captures 
the more general structure in Figure 1. 

Note that just as Jl} can be interpreted as a projection via (0, 
can be interpreted as a (weighted) projection of the vector 
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(y, 0, . . . , 0) e IT x H% onto the set C = ng^Ci, with 

d = {(z, A,... , / m ) : /ifo) = Zj Vj eSi.zG R™, (9) 

C H K } C R" x W^. 

The significance of this observation lies in the fact that the relaxed 
form of the regularized kernel least- squares estimator has been 
expressed as a projection onto the intersection of a collection of 
m convex sets; in particular, note that each set C{ is a subspace. 
Thus, by Lemma 1, the SOP algorithm can be used to solve the 
relaxed problem Q. Moreover, computing Pd(') requires agent 
i to gather examples only within its locally accessible database. 
More precisely, note that for any v = (z, f\ , . . . , f m ) G R n x H 7 ^, 
Pc» = (z*,AV..,/*) where 

/; = fj VjV* 

ft = arg min ^U^j) ~ Zj) 2 + \\\f - fi\\ 2 UK 

z) = Zj Vj s.t. j <£ Si 
z j = fl( x j) Vj s.t. j e Si 

To emphasize, computing (v) leaves Zj unchanged for all 
j ^ 5i and leaves fj unchanged for all j ^ i. The function 
associated with agent i, /* can be computed using fi and Si 
after the training data labels {yj}j e g. have been updated with 
the corresponding "message variables" {zj}j e s.. Tying these 
observations together, we are left with an algorithm for collab- 
orative regression estimation which solves a relaxed form of the 
regularized least- squares estimator (Q>. 

The algorithm is summarized in psuedo-code in Table 1 and 
depicted pictorially in Figure 5. In words, the algorithm iterates 
over each agent in turn, allowing them to compute a local 
kernel estimate and to update the labels in the training database 
accordingly. Multiple passes (in fact, T cycles) over the agents 
are made. 




Fig. 5. A Collaborative Training Algorithm 

C. Convergence 

Note that the asymptotic behavior of the collaborative training 
algorithm is implied by the analysis of the SOP algorithm. In 
particular, we have the following. 

Theorem 3: Let (z, f Xl , . . . , f Xm ) G R n x H% be the solution 
to Q) and let {fi y r}iLi C Hk be as defined in the algorithm 
described in Table I. Then, 

lim f iiT f Xi 

T— »oo 

for all i G {1, . . . , m}. 

This theorem follows from Theorem 2 and the fact that con- 
vergence in norm implies point- wise convergence in RKHSs. 



Given the structure of RKHS and the general analysis in [2], the 
algorithm is expected to converge linearly for many kernels. We 
forego a discussion of this important, but technical point for the 
sake of space. 

Observe that Theorem 3 characterizes the output of collab- 
orative training algorithm relative to 0. This characterization 
is useful insofar as it sheds light on the relationship between 
the algorithm's output and {l}, the centralized regularized least- 
squares estimator. The following straightforward generalization of 
Theorem 1 is a step toward further understanding this important 
relationship. 

Theorem 4: Let (z, f Xl , . . . , f\ m ) G R n x V™ be the solution 
to Q> • Then, for every agent i G {1, . . . , m}, there exists ca^ G 
K 1 ^ 1 such that 

fx i {')=Y._CKjK{',x j ). (10) 

jeSi 

The proof of this theorem follows from the original Representer 
Theorem (applied to the update equation for f iyt ) and the fact that 
Hk is closed. 

The significance of Theorem 4 lies in the fact that the size 
of any agent's locally accessible database fundamentally limits 
the accuracy of that agent's estimate. In particular, an agent 
having access to only a few exemplars in an otherwise large 
training database will still be limited to estimates that lie in 
the span of functions determined by its local data; thus, local 
connectivity influences the agent's bias. Intuitively, however, the 
message-passing through the training database may optimize the 
estimator within that limited span if the ensemble is "connected" 
in some meaningful way. To bear out this intuition in a simplified 
theoretical setting, we consider a simple notion of connectedness 
in the next section. 

D. A Simplified Setting 

For a given ensemble, kernel pair ({5;}^, K), let us construct 
an auxiliary graph as follows: let there be a node for every learning 
agent and let there be an edge between node (i.e., agent) i and 
node k if the following condition holds: 

spm({K(',Xj)} je g.) = sp3n({K(-,Xj)} je s k ) (H) 

= span({K(.,x j )} ie 5 i n5 fc ) 

In other words, an edge connects two nodes if the training exam- 
ples they share determine the space of functions their estimates 
lie in as dictated by Theorem 4. 

Definition 1: Let us call the ensemble, kernel pair 
({Si}i^ 1: K) connected if and only if the auxiliary graph 
so constructed is connected. 

This definition leads to the following theorem, which can be 
viewed as a straightforward generalization of Lemma 1 . 

Theorem 5: Let ({Si} 7 i f L lJ K) be connected and suppose the 
ensemble employs the collaborative training algorithm using 
{^i}iLi- Finally, let f x denote the solution to © for A = 

YT=i ^i- Tnen > 

f x = lim f itT (12) 

T— »oo 

for all i G {1, . . . , m}. 

Theorem 5 follows from Theorem 3 after noting that connect- 
edness implies that the solution to (z, f Xl , . . . , f Xm ) satisfies 
f Xl = • • • = f Xm . To illustrate the significance of Theorem 5 
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Init: Agents agree on a positive semi-definite kernel K (-,-): X x X ^ JR. 
Training database S = {(xi, Zi)}™ =1 is initialized 
so that Zi = yi Vi G {1, . . . , n). 

Train: for t = 1,...,T 

for i = 1 , . . . , m 
Agent i: 

Retrieves database Si C S 

Computes / i>t := argmin /e?iK [ E jG ^ (/( x i ) ~ *j) 2 + - /<,t-i||^ K ] 
Updates database: <— /i,t(xj) W(xj,Zj) G 

end 

end 



TABLE I 

An Algorithm for Training Collaboratively 



and to tie it to the foregoing discussion, consider the following 
example. 

Example 1: Suppose X = R d and that if(x, x') = x T x' is 
the linear kernel, in this case, 7Yk is the set of linear functions 
on X. If {S^ is an ensemble with public database of d 
linearly independent examples (depicted in Figure 3 and discussed 
in Section III), then ({^}^ l5 K) is connected. Therefore, by 
Theorem 5, the collaborative training algorithm would allow agent 
i to find the best linear fit to the entire data set S (for the 
particular choice of regularization parameter A), despite the fact 
that only ^^^n+d percent of the data is locally accessible. 
More generally, if a p th order polynomial kernel is used, then 
an analogous observation holds when d v examples are shared. 

In this simple example, the potential utility of the collaborative 
training algorithm is revealed. Consider the extreme case when 
each agent has access to only a single example in addition 
to the public database. As the number of agents m — > oo, 
the collaborative training algorithm would allow every agent a 
consistent estimate of the optimal linear least-squares estimate as 
long as YhLi ~~ ^ 0; this is true despite the fact that each agent 
retains local access to only d + 1 examples for all m. 

IV. Discussion 

As described in Table 1, the inner loop of the collaborative 
training algorithm iterates over agents in the ensemble serially. 
Note that the ordering is non-essential and parallelism may be 
introduced. In fact, two agents can train simultaneously as long 
as they do not share exemplars in their locally accessible training 
database. In practical settings, multiple-access algorithms that 
are frequently studied in the communications literature (e.g., 
ALOHA) may be adapted to negotiate an ordering in a distributed 
fashion. Since the SOP algorithm and Theorem 2 have been 
generalized to a very general class of (perhaps random) control 
orderings [2], Theorem 3 can be extended in many cases. Experi- 
ments that validate the collaborative training algorithm in a WSN 
setting can be found in [14]. 

In this paper, we have focused exclusively on regularized kernel 
least-squares regression. However using Bregman's algorithm [4], 
the method and many of the theorems may be extended to 
more general loss functions and regularizers including Bregman 
divergences. 

Those familiar with LDPC codes or Bayes networks may find 
the current model and algorithm reminiscent of message-passing 
algorithms such a belief-propagation which are frequently studied 
in those fields; variational interpretations of kernel methods in 
the context of Gaussian processes further suggests a relationship 



between these works. Formalizing such a connection would likely 
require one to interpret our "relaxation" in the context of depen- 
dency structures in Gaussian processes, and to connect alternating 
projection algorithms with the generalized distributive law [1]. 
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